In this Homework, we will be examining sales data from Trendyol for 9 different Products. The data include products from various categories and they will be examined seperately. For each product, we will try to find a pattern that the sales follows and examine the general trend. We will use this pattern to decompose it from the data, and then we will use AR, MA and their combination to determine the best fitting prediction method. Such predictions based on sales is important for companies to plan their pricing or inventories. This will be a primitive version of works that most of firms deal with in real time. In the data some of the regressors were not accurate such as overall visits to Trendyol Website, so try not to involve such variables into our models
library(data.table)
library(ggplot2)
library(skimr)
library(ggcorrplot)
library(GGally)
library(forecast)
library(lubridate)
library(urca)
load("C:/Users/Asus/Desktop/Spring 2021/IE 360/hw4-5/homework 4-5.RData")
In this part of the analysis, the series are decomposed at different levels.
tsdisplay(ts(xiaomi$sold_count))
First, series is plotted to detect the possible seasonal periods. As it can be seen from the autocorrelation plot below, autocorrelation is highest at lag 32.
acf(xiaomi$sold_count, lag.max = 50)
The data is first decomposed at a weekly level with frequency 7, the results are displayed below. Additive decomposition is chosen since the variance of the series is not affected with level fo the time series. The random component is stationary with a mean close to zero and constant variance. Also, based on the result of KPSS test, the random component is stationary.
plot(xiaomi_ts_week_dec)
summary(ur.kpss(xiaomi_ts_week_dec$random))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0112
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Second, the data is decomposed with frequency 32, the results are displayed below. Additive decomposition is chosen since the variance of the series is not affected with level fo the time series. The random component is stationary with a mean close to zero and constant variance. Also, based on the result of KPSS test, the random component is stationary.
plot(xiaomi_ts_year_dec)
summary(ur.kpss(xiaomi_ts_year_dec$random))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0131
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Since the random component from the first decomposition looks more like stationary and its KPSS results are better, it is chosen for model construction.
ACF and PACF plots indicate a moving average behavior since PACF decrease exponentially and ACF spikes and becomes insignificant after some point.
acf(xiaomi_random, na.action = na.pass, lag.max = 50)
pacf(xiaomi_random, na.action = na.pass, lag.max = 50)
After trying several alternatives, model below is chosen with parameters p = 2 and q = 4. The model chosen has the lowest AIC value among the alternatives.
summary(m3)
##
## Call:
## arima(x = xiaomi_random, order = c(2, 0, 4))
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 ma4 intercept
## -0.1073 -0.1799 -0.0238 -0.1873 -0.6167 -0.1722 -0.0305
## s.e. NaN 0.0819 NaN NaN NaN NaN 0.0989
##
## sigma^2 estimated as 9381: log likelihood = -2333.65, aic = 4683.3
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -1.117634 96.85467 61.68407 89.07912 177.5698 0.6525615
## ACF1
## Training set -0.003936123
Residuals of the model satisfy the model assumptions. Residuals distributed normally with zero mean and approximately constant variance. However, autocorrelations at lag 4 and lag 5 are slightly significant.
checkresiduals(m3)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,4) with non-zero mean
## Q* = 23.013, df = 7, p-value = 0.001696
##
## Model df: 7. Total lags used: 14
Fitted (red) and real (black) values are plotted below.
ggplot(xiaomi, aes(x = event_date))+
geom_line(aes(y = Model1), color = 'red')+
geom_line(aes(y = sold_count), color = 'black')
Finally, results from the decomposition and arima model are combined. Overall, residuals satisfy the model assumptions.
ggplot(xiaomi, aes(x = event_date))+
geom_line(aes(y = sold_count - Model1))
## Warning: Removed 6 row(s) containing missing values (geom_path).
acf(xiaomi$sold_count-xiaomi$Model1, na.action = na.pass)
ggpairs(xiaomi[,c(4,18:20)])
# plot(xiaomi$price) # ok
# plot(xiaomi$visit_count)
# plot(xiaomi$basket_count) # ok
# plot(xiaomi$favored_count)
# plot(xiaomi$category_sold) # not ok unusual behavior
# plot(xiaomi$category_basket)
# plot(xiaomi$category_visits)
# plot(xiaomi$category_favored) # ok
# plot(xiaomi$category_brand_sold)
Only price with lag 1, basket count with lag 2 and category favored with lag 2 can be considered as potential regressors since rest of the variables are either missing or dirty.
Among the alternatives price with lag 1 and basket count with lag 2 are highly correlated with the sales data.
At this stage of the analysis, external regressors determined above are added to model. Results are as follows.
summary(m2)
##
## Call:
## arima(x = xiaomi_random, order = c(3, 0, 2), xreg = matrix(c(xiaomi$price_lag1,
## xiaomi$basket_count_lag2), byrow = FALSE, ncol = 2))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 intercept
## 0.3538 0.1219 -0.3731 -0.5472 -0.4528 -11.1106
## s.e. 0.1637 0.1367 0.0695 0.1723 0.1722 2.5258
## matrix(c(xiaomi$price_lag1, xiaomi$basket_count_lag2), byrow = FALSE, ncol = 2)1
## 0.0381
## s.e. 0.0188
## matrix(c(xiaomi$price_lag1, xiaomi$basket_count_lag2), byrow = FALSE, ncol = 2)2
## 0.0038
## s.e. NaN
##
## sigma^2 estimated as 9041: log likelihood = -2326.78, aic = 4671.55
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -2.0912 95.08518 61.9725 48.4812 217.848 0.6556128 -0.001766772
checkresiduals(m2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,0,2) with non-zero mean
## Q* = 19.462, df = 6, p-value = 0.003451
##
## Model df: 8. Total lags used: 14
Arima model satisfies the model assumptions. There is a slightly significant autocorrelation at lag 6. However, it is not highly significant.
Finally, decomposition and arima values are combined. Fitted values and real values are plotted below.
ggplot(xiaomi, aes(x = event_date))+
geom_line(aes(y = Model2), color = 'red')+
geom_line(aes(y = sold_count), color = 'black')
Model assumptions are satisfied, residuals are distributed with mean zero and constant variance and they have no significant autocorrelation.
ggplot(xiaomi, aes(x = event_date))+
geom_line(aes(y = sold_count - Model2))
acf(xiaomi$sold_count-xiaomi$Model2, na.action = na.pass)
frame_xiaomi
## n mean sd CV FBias MAPE RMSE MAD
## 1 7 300.1429 53.28986 0.1775483 -0.3820318 0.4141282 122.3179 114.6641
## 2 7 300.1429 53.28986 0.1775483 -0.3386647 0.3727282 110.1031 101.6478
## MADP WMAPE WMAPE2
## 1 0.3820318 0.3820318 0.3820318
## 2 0.3386647 0.3386647 0.3386647
Finally two models are tested over a period of one week. Second model, model with external regressors, performs better as it has a lower WMAPE value.
tsdisplay(ts(fakir$sold_count))
First, series is plotted to detect the possible seasonal periods. As it can be seen from the autocorrelation plot below, autocorrelation is highest at lag 16.
acf(fakir$sold_count, lag.max = 50)
The data is first decomposed at a weekly level with frequency 7, the results are displayed below. Additive decomposition is chosen since the variance of the series is not affected with level fo the time series. The random component is stationary with a mean close to zero and constant variance. There is a slight increase in the variance through the middle of the series. Also, based on the result of KPSS test, the random component is stationary.
plot(fakir_ts_week_dec)
summary(ur.kpss(fakir_ts_week_dec$random))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0063
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Second, the data is decomposed with frequency 16, the results are displayed below. The random component is stationary with a mean close to zero and constant variance. Also, based on the result of KPSS test, the random component is stationary.
Since the random component from the first decomposition looks more stationary and its KPSS results are better, it is chosen for model construction.
plot(fakir_ts_year_dec)
summary(ur.kpss(fakir_ts_year_dec$random))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0209
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ACF and PACF plots indicate a moving average behavior since PACF decrease exponentially and ACF spikes and becomes insignificant after some point. However, there is slight increase in autocorrelation at lag 16.
acf(fakir_random, na.action = na.pass, lag.max = 50)
pacf(fakir_random, na.action = na.pass, lag.max = 50)
After trying several alternatives, model below is chosen with parameters q = 4. The model chosen has the lowest AIC value among the alternatives. These parameters also implied in the ACF plots as the autocorrelation decreases considerably after lag 4.
summary(m1_fakir)
##
## Call:
## arima(x = fakir_random, order = c(0, 0, 4), include.mean = FALSE)
##
## Coefficients:
## ma1 ma2 ma3 ma4
## -0.0838 -0.1935 -0.4385 -0.2842
## s.e. 0.0473 0.0446 0.0423 0.0508
##
## sigma^2 estimated as 259.2: log likelihood = -1635.42, aic = 3280.84
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.1586474 16.10023 8.611572 67.98593 166.7596 0.6871277 0.01105164
Residuals of the model satisfy the model assumptions. Residuals distributed normally with zero mean and approximately constant variance. However, autocorrelations at lag 16 are slightly significant. Also, distribution of the residuals are highly centered compared to normal distribution.
checkresiduals(m1_fakir)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,4) with zero mean
## Q* = 30.738, df = 10, p-value = 0.000648
##
## Model df: 4. Total lags used: 14
Fitted (red) and real (black) values are plotted below.
ggplot(fakir, aes(x = event_date))+
geom_line(aes(y = Model1), color = 'red')+
geom_line(aes(y = sold_count), color = 'black')
Finally, results from the decomposition and arima model are combined. Overall, residuals mostly satisfy the model assumptions. Variance of the series inreases during the middle of the series. Also, autocorrelation is not significant
ggplot(fakir, aes(x = event_date))+
geom_line(aes(y = sold_count - Model1))
acf(fakir$sold_count-fakir$Model1, na.action = na.pass)
ggpairs(fakir[,c(4,18:21)])
# plot(fakir$price) # ok
# plot(fakir$visit_count)
# plot(fakir$basket_count) # ok
# plot(fakir$favored_count)
# plot(fakir$category_sold) # ok
# plot(fakir$category_basket)
# plot(fakir$category_visits)
# plot(fakir$category_favored) # ok
# plot(fakir$category_brand_sold)
Only price with lag 1, basket count with lag 2, category sold with lag 2 and category favored with lag 2 can be considered as potential regressors since rest of the variables are either missing or dirty.
Among the alternatives all variables except category sold with lag 2 are highly correlated with the sales data.
At this stage of the analysis, external regressors determined above are added to model. Results are as follows.
summary(m2_fakir)
##
## Call:
## arima(x = fakir_random, order = c(3, 0, 1), xreg = matrix(c(fakir$price_lag1,
## fakir$basket_count_lag2, fakir$category_favored_lag2), byrow = FALSE, ncol = 3),
## include.mean = FALSE)
##
## Coefficients:
## ar1 ar2 ar3 ma1
## 0.8565 -0.1345 -0.2929 -1.0000
## s.e. 0.0484 0.0646 0.0484 0.0067
## matrix(c(fakir$price_lag1, fakir$basket_count_lag2, fakir$category_favored_lag2), byrow = FALSE, ncol = 3)1
## -8e-04
## s.e. 1e-04
## matrix(c(fakir$price_lag1, fakir$basket_count_lag2, fakir$category_favored_lag2), byrow = FALSE, ncol = 3)2
## 6e-04
## s.e. 4e-04
## matrix(c(fakir$price_lag1, fakir$basket_count_lag2, fakir$category_favored_lag2), byrow = FALSE, ncol = 3)3
## 0
## s.e. NaN
##
## sigma^2 estimated as 241.1: log likelihood = -1621.87, aic = 3259.75
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.2120418 15.5258 8.705402 27.66236 257.1356 0.6946145 0.003801097
checkresiduals(m2_fakir)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,0,1) with zero mean
## Q* = 29.123, df = 7, p-value = 0.0001374
##
## Model df: 7. Total lags used: 14
Arima model satisfies the model assumptions. There is a slightly significant autocorrelation at lag 16. However, it is not highly significant.
Finally, decomposition and arima values are combined. Fitted values and real values are plotted below.
ggplot(fakir, aes(x = event_date))+
geom_line(aes(y = Model2), color = 'red')+
geom_line(aes(y = sold_count), color = 'black')
Model assumptions are satisfied, residuals are distributed with mean zero and constant variance and they have no significant autocorrelation. Although, there are some outlier observations where the errors get larger.
ggplot(fakir, aes(x = event_date))+
geom_line(aes(y = sold_count - Model2))
acf(fakir$sold_count-fakir$Model2, na.action = na.pass)
Finally two models are tested over a period of one week. First model, model without external regressors, performs better as it has a lower WMAPE value.
frame_fakir
## n mean sd CV FBias MAPE RMSE MAD
## 1 7 300.1429 53.28986 0.1775483 -0.3400742 0.3762535 113.3208 102.0708
## 2 7 300.1429 53.28986 0.1775483 -0.3740729 0.4064323 120.4035 112.2753
## MADP WMAPE WMAPE2
## 1 0.3400742 0.3400742 0.3400742
## 2 0.3740729 0.3740729 0.3740729
tsdisplay((mont_ts_week))
acf(mont$sold_count, lag.max = 50)
First, series is plotted to detect the possible seasonal periods. As it can be seen from the autocorrelation plot below, autocorrelation is highest at lag 20.
acf(mont$sold_count, lag.max = 50)
The data is first decomposed at a weekly level with frequency 7, the results are displayed below. Additive decomposition is chosen since the variance of the series is not affected with level fo the time series. Variance of the random component increases during the outlier observation in the middle of the series. Other than that, the random component is stationary with a mean close to zero and constant variance. Also, based on the result of KPSS test, the random component is stationary.
plot(mont_ts_week_dec)
summary(ur.kpss(mont_ts_week_dec$random))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0087
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Second, the data is decomposed with frequency 20, the results are displayed below. The random component is stationary with a mean close to zero and constant variance. Also, based on the result of KPSS test, the random component is stationary.
Since the random component from the first decomposition looks more stationary and its KPSS results are better, it is chosen for model construction.
plot(mont_ts_year_dec)
summary(ur.kpss(mont_ts_year_dec$random))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0152
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ACF and PACF for the sales data are plotted below.
acf(mont_random, na.action = na.pass, lag.max = 50)
pacf(mont_random, na.action = na.pass, lag.max = 50)
After trying several alternatives, model below is chosen with parameters q = 5. The model chosen has the lowest AIC value among the alternatives. These parameters also implied in the ACF and PACF plots indicate a moving average behavior.
summary(m1_mont)
##
## Call:
## arima(x = mont_random, order = c(0, 0, 5), include.mean = FALSE)
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5
## -0.5254 -0.6688 -0.3011 0.1166 0.3787
## s.e. 0.0478 0.0536 0.0599 0.0534 0.0432
##
## sigma^2 estimated as 3.678: log likelihood = -809.77, aic = 1631.54
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.009715837 1.91781 0.7620904 4183.206 4623.154 0.6267887
## ACF1
## Training set 0.01107999
Residuals of the model slightly satisfy the model assumptions. Residuals distributed approximately normally with zero mean and approximately constant variance. The residuals are more centered compared to normal distribution. Also, autocorrelations at lag 19 are slightly significant.
checkresiduals(m1_mont)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,5) with zero mean
## Q* = 12.25, df = 9, p-value = 0.1996
##
## Model df: 5. Total lags used: 14
Fitted (red) and real (black) values are plotted below.
ggplot(mont, aes(x = event_date))+
geom_line(aes(y = Model1), color = 'red')+
geom_line(aes(y = sold_count), color = 'black')
Finally, results from the decomposition and arima model are combined. Overall, residuals mostly satisfy the model assumptions. Variance of the series increases during the middle of the series. Also, autocorrelation is not significant
ggplot(mont, aes(x = event_date))+
geom_point(aes(y = sold_count - Model1))
acf(mont$sold_count-mont$Model1, na.action = na.pass)
ggpairs(mont[,c(4,18:21)])
# plot(fakir$price) # ok
# plot(fakir$visit_count)
# plot(fakir$basket_count) # ok
# plot(fakir$favored_count)
# plot(fakir$category_sold) # ok
# plot(fakir$category_basket)
# plot(fakir$category_visits)
# plot(fakir$category_favored) # ok
# plot(fakir$category_brand_sold)
Only price with lag 1, basket count with lag 2, category sold with lag 2 and category favored with lag 2 can be considered as potential regressors since rest of the variables are either missing or dirty.
Among the alternatives only basket count with lag 2 is highly correlated with the sales data.
At this stage of the analysis, external regressors determined above are added to model. Results are as follows.
summary(m2_mont)
##
## Call:
## arima(x = mont_random, order = c(3, 0, 2), xreg = matrix(c(mont$basket_count_lag2),
## byrow = FALSE, ncol = 1))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 intercept
## 1.3572 -0.8529 0.1953 -1.8968 0.8969 -0.002
## s.e. 0.0622 0.0782 0.0573 0.0378 0.0375 NaN
## matrix(c(mont$basket_count_lag2), byrow = FALSE, ncol = 1)
## 3e-04
## s.e. NaN
##
## sigma^2 estimated as 3.652: log likelihood = -808.55, aic = 1633.1
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.01453231 1.911118 0.773554 1279.604 4015.209 0.6362171
## ACF1
## Training set -0.01120829
checkresiduals(m2_mont)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,0,2) with non-zero mean
## Q* = 21.6, df = 7, p-value = 0.002977
##
## Model df: 7. Total lags used: 14
In the analysis of the residuals we see that the variance of the residuals when there are extreme observations. There is a slightly significant autocorrelation at lag 16. However, it is not highly significant. Also, distribution is centered more when compared to normal distribution.
Finally, decomposition and arima values are combined. Fitted values and real values are plotted below.
ggplot(mont, aes(x = event_date))+
geom_line(aes(y = Model2), color = 'red')+
geom_line(aes(y = sold_count), color = 'black')
In the analtsis of th residuals we see that there are some outlier observations where the errors get larger. Also, there is some seasonality in the residuals since for most of the periods trend equals to zero. Finally, there is no significant autocorrelation among the residuals.
ggplot(mont, aes(x = event_date))+
geom_point(aes(y = sold_count - Model2))
acf(mont$sold_count-mont$Model2, na.action = na.pass)
Finally two models are tested over a period of one week. First model, model without external regressors, performs better as it has a lower WMAPE value.
frame_mont
## n mean sd CV FBias MAPE RMSE MAD
## 1 7 300.1429 53.28986 0.1775483 -0.3699630 0.4056711 120.5562 111.0417
## 2 7 300.1429 53.28986 0.1775483 -0.3818521 0.4186059 123.2360 114.6102
## MADP WMAPE WMAPE2
## 1 0.3699630 0.3699630 0.3699630
## 2 0.3818521 0.3818521 0.3818521
rm(list = ls())
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
FBias=sum(error)/sum(actual)
MAPE=sum(abs(error/actual))/n
RMSE=sqrt(sum(error^2)/n)
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE = sum((abs(error)/actual)*actual)/sum(actual)
#WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE)
return(l)
}
updated_data <- read.csv("updated_data.csv")
bebek_mendil <- updated_data %>%
mutate(event_date=as.Date(event_date)) %>%
arrange(event_date) %>%
filter(product_content_id==4066298) %>%
as.data.table()
acf(bebek_mendil$sold_count, na.action = na.pass)
pacf(bebek_mendil$sold_count, na.action = na.pass)
There is not any clear seasonality.
ggplot(bebek_mendil, aes(x=event_date))+
geom_line(aes(y=sold_count))+
labs(title="Sales of Sleepy over time")
Variance changes over time, and slightly increases in the period where series have an upward trend. Multiple decomposition can be used.
bebek_mendil_ts <- ts(bebek_mendil$sold_count, frequency = 7)
bebek_mendil_decomposed <- decompose(bebek_mendil_ts,type = "multiplicative")
plot(bebek_mendil_decomposed)
bebek_mendil_decomposed$random %>%
ur.kpss() %>%
summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.1391
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
weekly_bebek_mendil <- bebek_mendil %>%
group_by(yearweek(event_date)) %>%
summarise(sold_count=mean(sold_count)) %>%
rename(yearweek='yearweek(event_date)')
ggplot(weekly_bebek_mendil, aes(x=yearweek,y=sold_count ))+
geom_line()+
labs(title="Average weekly sales of Sleepy over time")
We can use additive decomposition, because the variance does not increase significantly as the trend increases.
acf(weekly_bebek_mendil$sold_count, na.action = na.pass)
pacf(weekly_bebek_mendil$sold_count, na.action = na.pass)
bebek_mendil_weekly_ts <- ts(weekly_bebek_mendil$sold_count, frequency = 7)
bebek_mendil_decomposed_weekly <- decompose(bebek_mendil_weekly_ts, type = "additive")
plot(bebek_mendil_decomposed_weekly)
bebek_mendil_decomposed_weekly$random %>%
ur.kpss() %>%
summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 3 lags.
##
## Value of test-statistic is: 0.0748
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
monthly_bebek_mendil <- bebek_mendil %>%
group_by(yearmonth(event_date)) %>%
summarise(sold_count=mean(sold_count)) %>%
rename(month='yearmonth(event_date)')
ggplot(monthly_bebek_mendil, aes(x=month,y=sold_count))+
geom_line()+
labs(title="Average monthly sales of Sleepy over time")
Too few data points, it is not meaningful to decompose.
random_bebek_mendil <- bebek_mendil_decomposed$random
acf(random_bebek_mendil, na.action = na.pass)
pacf(random_bebek_mendil, na.action = na.pass)
Both acf and pacf slightly sinusoidal, we should try both AR and MA models.
arima(random_bebek_mendil, order = c(1,0,0))
##
## Call:
## arima(x = random_bebek_mendil, order = c(1, 0, 0))
##
## Coefficients:
## ar1 intercept
## 0.3164 0.9611
## s.e. 0.0476 0.0254
##
## sigma^2 estimated as 0.1198: log likelihood = -141.88, aic = 289.76
arima(random_bebek_mendil, order = c(2,0,0))
##
## Call:
## arima(x = random_bebek_mendil, order = c(2, 0, 0))
##
## Coefficients:
## ar1 ar2 intercept
## 0.4202 -0.3266 0.9609
## s.e. 0.0474 0.0474 0.0181
##
## sigma^2 estimated as 0.1069: log likelihood = -119.44, aic = 246.88
arima(random_bebek_mendil, order = c(3,0,0))
##
## Call:
## arima(x = random_bebek_mendil, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.3463 -0.2321 -0.2240 0.9606
## s.e. 0.0489 0.0505 0.0488 0.0145
##
## sigma^2 estimated as 0.1015: log likelihood = -109.19, aic = 228.39
arima(random_bebek_mendil, order = c(0,0,1))
##
## Call:
## arima(x = random_bebek_mendil, order = c(0, 0, 1))
##
## Coefficients:
## ma1 intercept
## 0.3993 0.9610
## s.e. 0.0423 0.0238
##
## sigma^2 estimated as 0.1144: log likelihood = -132.73, aic = 271.47
arima(random_bebek_mendil, order = c(2,0,1))
##
## Call:
## arima(x = random_bebek_mendil, order = c(2, 0, 1))
##
## Coefficients:
## ar1 ar2 ma1 intercept
## 1.0116 -0.5132 -0.7295 0.9601
## s.e. 0.0571 0.0432 0.0536 0.0085
##
## sigma^2 estimated as 0.09719: log likelihood = -100.71, aic = 211.42
The lowest AIC value is obtained in the model of order (2,0,1).
ggpairs(bebek_mendil[,-c(1,2,5,7,10,12,13)])
category_sold and/or category_favored may be used as regressors.
arima(random_bebek_mendil, xreg = bebek_mendil$category_sold, order=c(2,0,1))
##
## Call:
## arima(x = random_bebek_mendil, order = c(2, 0, 1), xreg = bebek_mendil$category_sold)
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
## ar1 ar2 ma1 intercept bebek_mendil$category_sold
## 0.4693 -0.1437 0.1323 0.6613 2e-04
## s.e. 0.1555 0.0768 0.1426 0.0360 NaN
##
## sigma^2 estimated as 0.07661: log likelihood = -53.42, aic = 118.83
auto.arima(random_bebek_mendil, xreg = bebek_mendil$category_sold)
## Series: random_bebek_mendil
## Regression with ARIMA(3,1,1)(0,0,1)[7] errors
##
## Coefficients:
## ar1 ar2 ar3 ma1 sma1 drift xreg
## 0.5852 -0.2582 0.0443 -0.9785 0.0995 -1e-03 2e-04
## s.e. 0.0738 0.0568 0.0617 0.0166 0.0519 3e-04 0e+00
##
## sigma^2 estimated as 0.07515: log likelihood=-47.03
## AIC=110.05 AICc=110.43 BIC=141.88
auto.arima(random_bebek_mendil, xreg = bebek_mendil$category_favored)
## Series: random_bebek_mendil
## Regression with ARIMA(2,1,1)(1,0,0)[7] errors
##
## Coefficients:
## ar1 ar2 ma1 sar1 drift xreg
## 0.4699 -0.2304 -0.9619 0.1001 -4e-04 0
## s.e. 0.0560 0.0554 0.0219 0.0522 2e-04 0
##
## sigma^2 estimated as 0.0903: log likelihood=-83.65
## AIC=181.3 AICc=181.59 BIC=209.15
auto.arima(random_bebek_mendil, xreg = cbind(bebek_mendil$category_favored,bebek_mendil$category_sold))
## Series: random_bebek_mendil
## Regression with ARIMA(3,1,1)(0,0,1)[7] errors
##
## Coefficients:
## ar1 ar2 ar3 ma1 sma1 xreg1 xreg2
## 0.6001 -0.2531 0.0495 -0.9715 0.0975 0 2e-04
## s.e. 0.0358 0.0579 0.0486 0.0151 0.0517 0 1e-04
##
## sigma^2 estimated as 0.07553: log likelihood=-47.85
## AIC=111.7 AICc=112.08 BIC=143.53
The model with the lowest AIC is obtained with the model of order (3,1,1), seasonal order (0,0,1), and external regressor category_sold. Also, it is observed that the external regressor have a significant impact on the AIC value: AIC of the model of order (2,0,1) decreases from 211.42 to 118.83.
test_dates <- c(as.Date("2021-06-24"):as.Date("2021-06-30"))
for(i in 1:length(test_dates)){
current_date=test_dates[i]-2
past_data <- bebek_mendil[event_date<=current_date,]
bebek_mendil_ts <- ts(past_data$sold_count, frequency = 7)
bebek_mendil_decomposed <- decompose(bebek_mendil_ts,type = "multiplicative")
model <- arima(bebek_mendil_decomposed$random,order = c(3,1,1),seasonal = c(0,0,1), xreg=past_data$category_sold)
forecasted=predict(model,n.ahead = 2,newxreg = bebek_mendil[event_date==test_dates[i],"category_sold"])
bebek_mendil[nrow(bebek_mendil)-length(test_dates)+i,Model1:=forecasted$pred[2]*bebek_mendil_decomposed$seasonal[(nrow(bebek_mendil)-length(test_dates)+i)%%7+7]*bebek_mendil_decomposed$trend[max(which(!is.na(bebek_mendil_decomposed$trend)))]]
}
m<-accu(bebek_mendil$sold_count[(nrow(bebek_mendil)+1-length(test_dates)):(nrow(bebek_mendil))],bebek_mendil$Model1[(nrow(bebek_mendil)+1-length(test_dates)):(nrow(bebek_mendil))])
m
## n mean sd CV FBias MAPE RMSE MAD
## 1 7 525.7143 140.9902 0.2681879 -0.2874911 0.3959773 248.2186 164.2288
## MADP WMAPE
## 1 0.3123917 0.3123917
ggplot(bebek_mendil,aes(x=event_date))+
geom_line(aes(y=Model1), color="red")+
xlim(as.Date("2021-06-24"),as.Date("2021-06-30"))+ #Update here
geom_line(aes(y=sold_count))
dis_fircasi <- updated_data %>%
mutate(event_date=as.Date(event_date)) %>%
arrange(event_date) %>%
filter(product_content_id==32939029) %>%
as.data.table()
acf(dis_fircasi$sold_count, na.action = na.pass)
pacf(dis_fircasi$sold_count, na.action = na.pass)
We see a relatively high autocorrelation in lag 6 and lag 7. I will set frequency=7 instead of 6, although autocorrelation at lag 6 is slightly higher, because we weekly seasonality is more meaningful.
ggplot(dis_fircasi, aes(x=event_date))+
geom_line(aes(y=sold_count))+
labs(title="Sales of Oral-B over time")
Variance changes over time, and slightly increases in the period where series have an upward trend. That is why I decided to use multiplicative decomposition, but additive decomposition is also a valid option.
dis_fircasi_ts <- ts(dis_fircasi$sold_count, frequency = 7)
dis_fircasi_decomposed <- decompose(dis_fircasi_ts,type = "multiplicative")
plot(dis_fircasi_decomposed)
dis_fircasi_decomposed$random %>%
ur.kpss() %>%
summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0795
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
weekly_dis_fircasi <- dis_fircasi %>%
group_by(yearweek(event_date)) %>%
summarise(sold_count=mean(sold_count)) %>%
rename(yearweek='yearweek(event_date)')
ggplot(weekly_dis_fircasi, aes(x=yearweek,y=sold_count ))+
geom_line()+
labs(title="Average weekly sales of Sleepy over time")
Variance increases as trend increases, multiplicative model can be used.
acf(weekly_dis_fircasi$sold_count, na.action = na.pass)
pacf(weekly_dis_fircasi$sold_count, na.action = na.pass)
dis_fircasi_weekly_ts <- ts(weekly_dis_fircasi$sold_count, frequency = 7)
dis_fircasi_decomposed_weekly <- decompose(dis_fircasi_weekly_ts, type = "multiplicative")
plot(dis_fircasi_decomposed_weekly)
monthly_dis_fircasi <- dis_fircasi %>%
group_by(yearmonth(event_date)) %>%
summarise(sold_count=mean(sold_count)) %>%
rename(month='yearmonth(event_date)')
ggplot(monthly_dis_fircasi, aes(x=month,y=sold_count))+
geom_line()+
labs(title="Average monthly sales of Sleepy over time")
Too few data points, it is not meaningful to decompose.
dis_fircasi_random <- dis_fircasi_decomposed$random
acf(dis_fircasi_random, na.action = na.pass)
pacf(dis_fircasi_random, na.action = na.pass)
Slightly sinusoidal acf and messy pacf, we can try both autoregresssive and moving average models.
arima(dis_fircasi_random, order=c(1,0,0))
##
## Call:
## arima(x = dis_fircasi_random, order = c(1, 0, 0))
##
## Coefficients:
## ar1 intercept
## 0.2547 0.9587
## s.e. 0.0492 0.0269
##
## sigma^2 estimated as 0.1558: log likelihood = -189.45, aic = 384.91
arima(dis_fircasi_random, order=c(2,0,0))
##
## Call:
## arima(x = dis_fircasi_random, order = c(2, 0, 0))
##
## Coefficients:
## ar1 ar2 intercept
## 0.3261 -0.2794 0.9586
## s.e. 0.0489 0.0488 0.0202
##
## sigma^2 estimated as 0.1436: log likelihood = -173.74, aic = 355.48
arima(dis_fircasi_random, order=c(1,0,1))
##
## Call:
## arima(x = dis_fircasi_random, order = c(1, 0, 1))
##
## Coefficients:
## ar1 ma1 intercept
## -0.0232 0.3388 0.9587
## s.e. 0.1062 0.0926 0.0259
##
## sigma^2 estimated as 0.1515: log likelihood = -184.07, aic = 376.14
arima(dis_fircasi_random, order=c(2,0,1))
##
## Call:
## arima(x = dis_fircasi_random, order = c(2, 0, 1))
##
## Coefficients:
## ar1 ar2 ma1 intercept
## 0.9412 -0.4583 -0.7306 0.9593
## s.e. 0.0605 0.0457 0.0540 0.0096
##
## sigma^2 estimated as 0.1299: log likelihood = -154.54, aic = 319.08
arima(dis_fircasi_random, order=c(3,0,2))
##
## Call:
## arima(x = dis_fircasi_random, order = c(3, 0, 2))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 intercept
## 0.0225 0.3997 -0.4647 0.1809 -0.6233 0.9593
## s.e. 0.0883 0.0849 0.0481 0.0916 0.0839 0.0098
##
## sigma^2 estimated as 0.1276: log likelihood = -151.17, aic = 316.34
ARIMA model of order (3,0,2) performed better; it has the lowest AIC value among these models.
ggpairs(dis_fircasi[,-c(1,2,5,7,10,12,13)])
It is not surprising that the basket_count is the most correlated variable with sold_count. But I will use category_favored as regressor, because I believe it is a more stable variable.
arima(dis_fircasi_random, order=c(3,0,2), xreg = dis_fircasi$category_favored)
##
## Call:
## arima(x = dis_fircasi_random, order = c(3, 0, 2), xreg = dis_fircasi$category_favored)
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
## ar1 ar2 ar3 ma1 ma2 intercept
## -0.0385 0.3515 -0.4465 0.2391 -0.5591 0.8975
## s.e. 0.0903 0.0882 0.0470 0.0932 0.0817 0.0100
## dis_fircasi$category_favored
## 0
## s.e. NaN
##
## sigma^2 estimated as 0.125: log likelihood = -147.01, aic = 310.01
The model only improved a little when regressor is added.
auto.arima(dis_fircasi_random, xreg = dis_fircasi$category_favored)
## Series: dis_fircasi_random
## Regression with ARIMA(0,0,1)(1,0,0)[7] errors
##
## Coefficients:
## ma1 sar1 intercept xreg
## 0.2909 0.0014 0.8287 0e+00
## s.e. 0.0557 0.0643 0.0575 1e-04
##
## sigma^2 estimated as 0.1454: log likelihood=-174.03
## AIC=358.06 AICc=358.22 BIC=377.86
auto.arima does not work well in this case.
test_dates <- c(as.Date("2021-06-24"):as.Date("2021-06-30"))
for(i in 1:length(test_dates)){
current_date=test_dates[i]-2
past_data <- dis_fircasi[event_date<=current_date,]
dis_fircasi_ts <- ts(past_data$sold_count, frequency = 7)
dis_fircasi_decomposed <- decompose(dis_fircasi_ts,type = "multiplicative")
model <- arima(dis_fircasi_decomposed$random,order = c(3,0,2), xreg=past_data$category_favored)
forecasted=predict(model,n.ahead = 2,newxreg = dis_fircasi[event_date==test_dates[i],"category_favored"])
dis_fircasi[nrow(dis_fircasi)-length(test_dates)+i,Model1:=forecasted$pred[2]*dis_fircasi_decomposed$seasonal[(nrow(dis_fircasi)-length(test_dates)+i)%%7+7]*dis_fircasi_decomposed$trend[max(which(!is.na(dis_fircasi_decomposed$trend)))]]
}
s<-accu(dis_fircasi$sold_count[(nrow(dis_fircasi)+1-length(test_dates)):(nrow(dis_fircasi))],dis_fircasi$Model1[(nrow(dis_fircasi)+1-length(test_dates)):(nrow(dis_fircasi))])
s
## n mean sd CV FBias MAPE RMSE MAD
## 1 7 72.85714 42.85024 0.5881405 -0.08780776 0.7994462 54.80752 49.75189
## MADP WMAPE
## 1 0.6828691 0.6828691
ggplot(dis_fircasi,aes(x=event_date))+
geom_line(aes(y=Model1), color="red")+
xlim(as.Date("2021-06-24"),as.Date("2021-06-30"))+ #Update here
geom_line(aes(y=sold_count))
yuz_temizleyici <- updated_data %>%
mutate(event_date=as.Date(event_date)) %>%
arrange(event_date) %>%
filter(product_content_id==85004) %>%
as.data.table()
acf(yuz_temizleyici$sold_count, na.action = na.pass)
pacf(yuz_temizleyici$sold_count, na.action = na.pass)
Interestingly, we see a relatively high autocorrelation in lag 15.
Data only has weak seasonality, but I set frequency=15, because autocorrelation is relatively higher in lag 15.
ggplot(yuz_temizleyici, aes(x=event_date))+
geom_line(aes(y=sold_count))+
labs(title="Sales of La Roche over time")
We see a slightly increasing trend, and also the variance is higher when compared to the previous periods, except the peaks in November 2020 and January 2021. Multiplicative decomposition would be more suitable for this series.
yuz_temizleyici_ts <- ts(yuz_temizleyici$sold_count, frequency = 15)
yuz_temizleyici_decomposed <- decompose(yuz_temizleyici_ts,type = "multiplicative")
plot(yuz_temizleyici_decomposed)
weekly_yuz_temizleyici <- yuz_temizleyici %>%
group_by(yearweek(event_date)) %>%
summarise(sold_count=mean(sold_count)) %>%
rename(yearweek='yearweek(event_date)')
ggplot(weekly_yuz_temizleyici, aes(x=yearweek,y=sold_count ))+
geom_line()+
labs(title="Average weekly sales of Sleepy over time")
Variance only slightly increases as trend increases, additive decomposition may be used.
acf(weekly_yuz_temizleyici$sold_count, na.action = na.pass)
pacf(weekly_yuz_temizleyici$sold_count, na.action = na.pass)
We should try ARMA models.
yuz_temizleyici_weekly_ts <- ts(weekly_yuz_temizleyici$sold_count, frequency = 9)
yuz_temizleyici_decomposed_weekly <- decompose(yuz_temizleyici_weekly_ts, type = "additive")
plot(yuz_temizleyici_decomposed_weekly)
monthly_yuz_temizleyici <- yuz_temizleyici %>%
group_by(yearmonth(event_date)) %>%
summarise(sold_count=mean(sold_count)) %>%
rename(month='yearmonth(event_date)')
ggplot(monthly_yuz_temizleyici, aes(x=month,y=sold_count))+
geom_line()+
labs(title="Average monthly sales of Sleepy over time")
Too few data points, it is not meaningful to decompose.
yuz_temizleyici_random <- yuz_temizleyici_decomposed$random
arima(yuz_temizleyici_random, order=c(1,0,0))
##
## Call:
## arima(x = yuz_temizleyici_random, order = c(1, 0, 0))
##
## Coefficients:
## ar1 intercept
## 0.5466 0.9896
## s.e. 0.0424 0.0476
##
## sigma^2 estimated as 0.1819: log likelihood = -220.1, aic = 446.21
arima(yuz_temizleyici_random, order=c(0,0,1))
##
## Call:
## arima(x = yuz_temizleyici_random, order = c(0, 0, 1))
##
## Coefficients:
## ma1 intercept
## 0.4327 0.9901
## s.e. 0.0363 0.0324
##
## sigma^2 estimated as 0.1988: log likelihood = -237.25, aic = 480.51
arima(yuz_temizleyici_random, order=c(2,0,0))
##
## Call:
## arima(x = yuz_temizleyici_random, order = c(2, 0, 0))
##
## Coefficients:
## ar1 ar2 intercept
## 0.5880 -0.0756 0.9899
## s.e. 0.0505 0.0505 0.0442
##
## sigma^2 estimated as 0.1809: log likelihood = -218.99, aic = 445.98
arima(yuz_temizleyici_random, order=c(1,0,2))
##
## Call:
## arima(x = yuz_temizleyici_random, order = c(1, 0, 2))
##
## Coefficients:
## ar1 ma1 ma2 intercept
## 0.2936 0.2565 0.3036 0.9896
## s.e. 0.0905 0.0847 0.0598 0.0465
##
## sigma^2 estimated as 0.1725: log likelihood = -209.85, aic = 429.7
auto.arima(yuz_temizleyici_random)
## Series: yuz_temizleyici_random
## ARIMA(2,0,3) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 mean
## 1.4414 -0.6675 -1.0483 0.3036 -0.1989 0.9882
## s.e. 0.0758 0.0672 0.0857 0.0765 0.0722 0.0050
##
## sigma^2 estimated as 0.1443: log likelihood=-172.91
## AIC=359.82 AICc=360.12 BIC=387.55
The model of order (2,0,3) performed better.
ggpairs(yuz_temizleyici[,-c(1,2,5,7,10,12,13)])
Naturally, basket count is the highest correlated variable with the sold count. But for forecasting, I believe using category favored as regressor is a better option, because I expect the variability to be smaller, since it depends on the interest for a category of products, instead a specific products.
arima(yuz_temizleyici_random, order=c(2,0,3), xreg = yuz_temizleyici$category_favored)
##
## Call:
## arima(x = yuz_temizleyici_random, order = c(2, 0, 3), xreg = yuz_temizleyici$category_favored)
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 intercept
## 1.3827 -0.3877 -0.8900 0.0884 -0.1544 0.1982
## s.e. 0.1541 0.1526 0.1505 0.0998 0.0763 0.0361
## yuz_temizleyici$category_favored
## 0
## s.e. NaN
##
## sigma^2 estimated as 0.1161: log likelihood = -133.26, aic = 282.51
Adding external regressors decreased AIC significantly.
auto.arima(yuz_temizleyici_random, xreg = yuz_temizleyici$category_favored)
## Series: yuz_temizleyici_random
## Regression with ARIMA(0,1,1)(0,0,1)[15] errors
##
## Coefficients:
## ma1 sma1 drift xreg
## -0.4189 0.0700 -1e-04 0
## s.e. 0.0599 0.0507 0e+00 0
##
## sigma^2 estimated as 0.1341: log likelihood=-158.41
## AIC=326.82 AICc=326.98 BIC=346.61
Auto arima does not work well in this case.
We selected our test period as “2021-06-24”-“2021-06-30”.
test_dates <- c(as.Date("2021-06-24"):as.Date("2021-06-30"))
for(i in 1:length(test_dates)){
current_date=test_dates[i]-2
past_data <- yuz_temizleyici[event_date<=current_date,]
yuz_temizleyici_ts <- ts(past_data$sold_count, frequency = 15)
yuz_temizleyici_decomposed <- decompose(yuz_temizleyici_ts,type = "multiplicative")
model <- arima(yuz_temizleyici_decomposed$random,order = c(2,0,3), xreg=past_data$category_favored)
forecasted=predict(model,n.ahead = 2,newxreg = yuz_temizleyici[event_date==test_dates[i],"category_favored"])
yuz_temizleyici[nrow(yuz_temizleyici)-length(test_dates)+i,Model1:=forecasted$pred[2]*yuz_temizleyici_decomposed$seasonal[(nrow(yuz_temizleyici)-length(test_dates)+i)%%15+15]*yuz_temizleyici_decomposed$trend[max(which(!is.na(yuz_temizleyici_decomposed$trend)))]]
}
t<-accu(yuz_temizleyici$sold_count[(nrow(yuz_temizleyici)+1-length(test_dates)):(nrow(yuz_temizleyici))],yuz_temizleyici$Model1[(nrow(yuz_temizleyici)+1-length(test_dates)):(nrow(yuz_temizleyici))])
t
## n mean sd CV FBias MAPE RMSE MAD MADP
## 1 7 71.28571 26.4305 0.3707686 0.233834 0.1771441 27.04648 16.67718 0.2339485
## WMAPE
## 1 0.2339485
ggplot(yuz_temizleyici,aes(x=event_date))+
geom_line(aes(y=Model1), color="red")+
xlim(as.Date("2021-06-24"),as.Date("2021-06-30"))+ #Update here
geom_line(aes(y=sold_count))
## Parsed with column specification:
## cols(
## event_date = col_date(format = ""),
## product_content_id = col_double(),
## price = col_double(),
## sold_count = col_double(),
## visit_count = col_double(),
## basket_count = col_double(),
## favored_count = col_double(),
## category_sold = col_double(),
## category_visits = col_double(),
## category_basket = col_double(),
## category_favored = col_double(),
## category_brand_sold = col_double(),
## ty_visits = col_double()
## )
raw_data <- data.table(raw_data)
raw_data[, "event_date" := as.Date(event_date)]
raw_data <- raw_data[event_date >= '2020-05-25']
raw_data[, trend := 1:.N]
raw_data[, month := month(event_date, label = TRUE)]
raw_data[, wday := wday(event_date, label = TRUE)]
raw_data <- raw_data[order(event_date),]
The leopard skin bikini is the next product. To analyze the data we read it first.
data_leopar <- raw_data[product_content_id == 73318567] #leopar
To understand the trend, the time series graph of the number of sold items should be examined.
ggplot(data_leopar, aes(x=event_date)) +
geom_line(aes(y = sold_count), color = "red") + ggtitle("Trend of Leopard Skin Bikini Sales") + xlab("Date") + ylab("Sales")
There are lots of missing points in the data. As we don’t have a full year data, we cannot check for a monthly trend. We will check for a daily trend and look at the autocorrelation plot to decide the lag.
acf(data_leopar$sold_count, main= "Daily Autocorrelation")
pacf(data_leopar$sold_count, main= "Daily Partial Autocorrelation")
The data told us there is a trend, however any specific lag does not stand out. The autocorrelation plot was for the data including the N/A values. To have a better understanding, we should check the data also after removing the N/A’s.
Data with clearing:
leopar <- na.omit(data_leopar)
ggplot(leopar, aes(x=event_date)) +
geom_line(aes(y = sold_count), color = "red") + ggtitle("Trend of Leopard Skin Bikini Sales N/A Omitted") + xlab("Date") + ylab("Sales")
Checking the autocorrelation and partial autocorrelation:
acf(leopar$sold_count, main= "Daily Autocorrelation")
pacf(leopar$sold_count, main= "Daily Autocorrelation")
The autocorrelation plot yields very similar results with N/A included version.
Although we did not see any peak point at day seven, it is logical to have a daily trend. That’s why, we will make the decomposition daily and have the frequency of 7. As the variance change over the time, we will make a multiplicative decomposition firstly and according to the results make an additive model.
To be able track the days of the week, we will use N/A included version in order not to loose any data point.
leoparts <- ts(data_leopar$sold_count,freq=7)
data_mult<-decompose(leoparts,type="multiplicative")
random=data_mult$random
plot(data_mult)
The trend is increasing as the summer time arrives, however we can see a sharp decrease in June due to the stock out of smaller sizes. When we examine the random part of the data, it seems pretty stationary with near to stable variance and mean except the two outlier peak points.
In this graph it is hard to identify seasonal daily trend, so we need to have a closer look.
plot(data_mult$seasonal[1:7], xlab = "Hour", ylab = "Multiplicative of Day", main = "Seasonal Component of Trend for Multiplicative")
mean_sold=data_leopar[,list(m_sold = mean(sold_count,na.rm=T)), by="wday"]
mean_sold
## wday m_sold
## 1: Mon 16.10345
## 2: Tue 17.22414
## 3: Wed 18.56897
## 4: Thu 18.41379
## 5: Fri 17.22807
## 6: Sat 20.78947
## 7: Sun 20.50877
To understand the multiplicative decomposition, we examine the multipliers of the days. To be able to compare the multiplicative behaviours and data, we also examine the mean sold of each day. Saturdays are the maximum points and the trend also reflect it. However, although the mean very high on Sundays, the multiplier of Sundays are low.
Having higher sales on the weekends are normal, as people have time to shop more.
To check the stationarity of the detrend part we use KPSS test
unt_test=ur.kpss(data_mult$random)
summary(unt_test)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 4 lags.
##
## Value of test-statistic is: 0.0863
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The data is stationary, because 0.0684 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary.
The multipliers of the multiplicative model were not satisfying, we will also check the additive decomposition.
leoparts <- ts(data_leopar$sold_count,freq=7)
data_add<-decompose(leoparts,type="additive")
random=data_add$random
plot(data_add)
The trend is very similar to the additive version. For the detrend part, the outliers seems to be dissapered here. In the additive model, remaining portion seems more stationary.
In this graph it is hard to identify seasonal daily trend, so we need to have a closer look.
plot(data_add$seasonal[1:7], xlab = "Hour", ylab = "Additive of Day", main = "Seasonal Component of Trend for Additive")
mean_sold=data_leopar[,list(m_sold = mean(sold_count,na.rm=T)), by="wday"]
mean_sold
## wday m_sold
## 1: Mon 16.10345
## 2: Tue 17.22414
## 3: Wed 18.56897
## 4: Thu 18.41379
## 5: Fri 17.22807
## 6: Sat 20.78947
## 7: Sun 20.50877
In the additive model, the multipliers of days are pretty similar to the mean sales of each days. The high trend on the weekends have a corresponding high multipliers.
To check the stationarity of the detrend part we use KPSS test again
unt_test=ur.kpss(data_add$random)
summary(unt_test)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0074
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The data is stationary, because 0.0072 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary. Also this is smaller than the multiplicative decomposition so I will continue my analysis with the additive model.
acf(random, na.action = na.pass, main= "Detrend's Autocorrelation")
pacf(random, na.action = na.pass, main= "Detrend's Autocorrelation")
We can see that the autocorrelation plot is decreasing. We can see the significant partial autocorrelation at lag 5. That’s why AR(5) will be tried first.
model <- arima(random, order=c(5,0,0))
print(model)
##
## Call:
## arima(x = random, order = c(5, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 intercept
## 0.0361 -0.2422 -0.2045 -0.1282 -0.2927 0.0147
## s.e. 0.0479 0.0475 0.0479 0.0474 0.0478 0.2627
##
## sigma^2 estimated as 91.12: log likelihood = -1459.34, aic = 2932.67
Also, the partial autocorrelaton plot seems to be decreasing and there is significant autocorrelation is at lag 5 at the autocorrelation plot. Let’s also try MA(5).
model <- arima(random, order=c(0,0,5))
print(model)
##
## Call:
## arima(x = random, order = c(0, 0, 5))
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 intercept
## -0.0828 -0.3490 -0.2766 -0.1211 -0.1705 0.0015
## s.e. 0.0491 0.0545 0.0567 0.0481 0.0603 0.0117
##
## sigma^2 estimated as 83.58: log likelihood = -1444.24, aic = 2902.47
As the AIC is lower than the first one this model is better.
We need to decide their combination and try the neighbours to decide which model to use.
model <- arima(random, order=c(5,0,5))
print(model)
##
## Call:
## arima(x = random, order = c(5, 0, 5))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.1543 -0.3693 -0.4067 0.5051 -0.3374 -0.2525 0.0365 0.1156
## s.e. 0.1350 0.1064 0.0822 0.1241 0.0929 0.1424 0.1259 0.0791
## ma4 ma5 intercept
## -0.8388 -0.0609 0.0017
## s.e. 0.1260 0.1397 0.0093
##
## sigma^2 estimated as 79.22: log likelihood = -1434.22, aic = 2892.43
This is the lowest AIC, thus the best model. Trying the neighbors:
modelf <- arima(random, order=c(5,0,4))
print(model)
##
## Call:
## arima(x = random, order = c(5, 0, 5))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.1543 -0.3693 -0.4067 0.5051 -0.3374 -0.2525 0.0365 0.1156
## s.e. 0.1350 0.1064 0.0822 0.1241 0.0929 0.1424 0.1259 0.0791
## ma4 ma5 intercept
## -0.8388 -0.0609 0.0017
## s.e. 0.1260 0.1397 0.0093
##
## sigma^2 estimated as 79.22: log likelihood = -1434.22, aic = 2892.43
Using MA(4) increased the performance. Checking for AR(4):
model <- arima(random, order=c(4,0,4))
print(model)
##
## Call:
## arima(x = random, order = c(4, 0, 4))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 ma2 ma3 ma4
## -0.0991 -0.0160 0.5042 -0.4244 0.0115 -0.3685 -0.9035 0.2605
## s.e. 0.1563 0.0888 0.0489 0.0891 0.1674 0.0512 0.0583 0.1523
## intercept
## 0.0015
## s.e. 0.0091
##
## sigma^2 estimated as 81.91: log likelihood = -1440.96, aic = 2901.91
This has lowered the performance so we will choose (5,0,4) for the arima models.
To understand the the the possible regressors that can be used in the ARIMA model, the correlations with both detrend data and sold_count is important.
Than, the lagged variable will be addded to the model as for the predictions we need to have the data.
leopar <- data_leopar
leopar <- leopar[, random := data_add$random]
numeric_data <- leopar
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame': 119 obs. of 12 variables:
## $ price : num 60 60 60 60 60 ...
## $ sold_count : num 1 2 2 2 1 1 5 3 6 1 ...
## $ visit_count : num 0 0 0 0 0 261 306 343 334 278 ...
## $ basket_count : num 4 11 13 7 2 7 25 28 35 17 ...
## $ favored_count : num 73 62 110 45 37 49 71 76 45 51 ...
## $ category_sold : num 124 156 219 148 134 175 457 372 360 229 ...
## $ category_visits : num 389 495 453 391 352 458 781 608 588 350 ...
## $ category_basket : num 0 0 0 0 0 ...
## $ category_favored : num 3246 3956 3427 2601 2167 ...
## $ category_brand_sold: num 14526 13797 20377 8414 7112 ...
## $ ty_visits : num 1 1 1 1 1 ...
## $ random : num -1.9975 -0.9524 3.0626 0.6528 0.0451 ...
## - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)
The biggest correlation with the data is with basket_count, visit_count and favored_count. Thus we are adding their lagged version and examine also their correlation. The price do not have a specific correlation as in here the data does not reflect every discount.
leopar <- leopar[,favored_count_lag := shift(favored_count, 2)]
leopar <- leopar[,basket_count_lag := shift(basket_count, 2)]
leopar <- leopar[,visit_count_lag := shift(visit_count, 2)]
numeric_data <- leopar
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame': 119 obs. of 15 variables:
## $ price : num 60 60 60 60 60 ...
## $ sold_count : num 1 2 2 2 1 1 5 3 6 1 ...
## $ visit_count : num 0 0 0 0 0 261 306 343 334 278 ...
## $ basket_count : num 4 11 13 7 2 7 25 28 35 17 ...
## $ favored_count : num 73 62 110 45 37 49 71 76 45 51 ...
## $ category_sold : num 124 156 219 148 134 175 457 372 360 229 ...
## $ category_visits : num 389 495 453 391 352 458 781 608 588 350 ...
## $ category_basket : num 0 0 0 0 0 ...
## $ category_favored : num 3246 3956 3427 2601 2167 ...
## $ category_brand_sold: num 14526 13797 20377 8414 7112 ...
## $ ty_visits : num 1 1 1 1 1 ...
## $ random : num -1.9975 -0.9524 3.0626 0.6528 0.0451 ...
## $ favored_count_lag : num 0 37 73 110 83 37 45 33 71 76 ...
## $ basket_count_lag : num 0 4 4 13 11 2 12 5 25 28 ...
## $ visit_count_lag : num 0 0 0 0 0 0 144 84 306 343 ...
## - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)
The biggest correlation is with the basket_count_lag, that’s why, we are trying it first.
reg_matrix=cbind( leopar$basket_count_lag) # can add more if any other regressors exist
model1 <- arima(leopar$random, order = c(5,0,4), xreg = reg_matrix)
summary(model1)
##
## Call:
## arima(x = leopar$random, order = c(5, 0, 4), xreg = reg_matrix)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.1882 -0.3865 -0.4217 0.524 -0.3827 -0.3052 0.0533 0.1220
## s.e. 0.0628 NaN 0.0626 NaN 0.0465 0.0386 NaN 0.0575
## ma4 intercept reg_matrix
## -0.8702 -0.0203 4e-04
## s.e. NaN 0.0117 2e-04
##
## sigma^2 estimated as 77.77: log likelihood = -1430.69, aic = 2885.38
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1958726 8.818632 3.977402 -17.58764 268.857 0.6383409
## ACF1
## Training set 0.0008931498
When we added the variable the AIC value has lowered, the model has improved. To see the performance and compare the two models, let’s check the predicted values and the actual data in the same graph.
model_fitted <- leopar$random - residuals(model1)
leopar <- cbind(leopar, data_add$seasonal, data_add$trend, model_fitted)
leopar <-leopar[,predict1 := data_add$seasonal + data_add$trend + model_fitted]
model_fitted_2 <- leopar$random - residuals(modelf)
leopar <- cbind(leopar, model_fitted_2)
leopar <-leopar[,predictonlyar := data_add$seasonal + data_add$trend + model_fitted_2]
ggplot(leopar, aes(x=event_date)) +
geom_line(aes(y = sold_count, color = "Sales")) +
geom_line(aes(y = predict1, color="Prediction with Arima")) +
geom_line(aes(y = predictonlyar, color = "Prediction with Arima + Regressors")) + ggtitle ("Actual vs Predicted Data")
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 6 row(s) containing missing values (geom_path).
Except for the peak points, the data seems to catch the trend well. As we can see from the plor both model yields pretty similar results.
Let’s add second highest correlation visit_count_lag.
reg_matrix=cbind( leopar$basket_count_lag, leopar$visit_count_lag) # can add more if any other regressors exist
model2 <- arima(leopar$random,order = c(5,0,4), xreg = reg_matrix)
summary(model2)
##
## Call:
## arima(x = leopar$random, order = c(5, 0, 4), xreg = reg_matrix)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.1907 -0.3915 -0.4173 0.5104 -0.3844 -0.3121 0.0542 0.1162
## s.e. 0.0737 0.1435 0.0799 0.1112 0.0478 0.0695 0.1511 0.0820
## ma4 intercept reg_matrix1 reg_matrix2
## -0.8583 -0.0099 0.0019 -1e-04
## s.e. 0.1372 0.0031 0.0008 0e+00
##
## sigma^2 estimated as 77.44: log likelihood = -1429.81, aic = 2885.62
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1108262 8.799832 3.973089 140.9859 380.9705 0.6376486
## ACF1
## Training set 0.0003948911
The performance does not increase, AIC remains the same, that’s why we will continue with one regressor. To make the prediction and evaluate the performance of our model, we will use the function that we use during the lectures.
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
FBias=sum(error)/sum(actual)
MAPE=sum(abs(error/actual))/n
RMSE=sqrt(sum(error^2)/n)
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE)
return(l)
}
We need to seperate the test and train data. We will use last 7 days to make the prediction. In order to compare the model we will examine both arima with regressor and arima stand alone.
Arima with regressor:
train_start=as.Date('2021-01-23')
test_start=as.Date('2021-06-24')
test_end=as.Date('2021-06-30')
test_dates=seq(test_start,test_end,by='day')
for(i in 1:length(test_dates)){
current_date=test_dates[i]-2
past_data <- leopar[event_date<=current_date,]
leopar_ts <- ts(past_data$sold_count, frequency = 7)
leopar_decomposed <- decompose(leopar_ts, type="additive")
model <- arima(leopar_decomposed$random,order = c(5,0,4),xreg = past_data$basket_count_lag)
forecasted=predict(model,n.ahead = 2,newxreg = leopar[event_date==test_dates[i],basket_count_lag])
leopar[nrow(leopar)-length(test_dates)+i-2, Model_reg := forecasted$pred[2]+leopar_decomposed$seasonal[(nrow(leopar)-length(test_dates)+i-2)%%7+7]+leopar_decomposed$trend[max(which(!is.na(leopar_decomposed$trend)))]]
}
m_with_reg<-accu(leopar$sold_count[(nrow(leopar)-1-length(test_dates)):(nrow(leopar)-2)],(leopar$Model_reg[(nrow(leopar)-1-length(test_dates)):(nrow(leopar)-2)]))
Arima without regressor:
for(i in 1:length(test_dates)){
current_date=test_dates[i]-2
past_data <- leopar[event_date<=current_date,]
leopar_ts <- ts(past_data$sold_count, frequency = 7)
leopar_decomposed <- decompose(leopar_ts, type="additive")
model <- arima(leopar_decomposed$random,order = c(5,0,4))
forecasted=predict(model,n.ahead = 2)
leopar[nrow(leopar)-length(test_dates)+i-2, Model_nolag := forecasted$pred[2]+leopar_decomposed$seasonal[(nrow(leopar)-length(test_dates)+i-2)%%7+7]+leopar_decomposed$trend[max(which(!is.na(leopar_decomposed$trend)))]]
}
m_only_arima<-accu(leopar$sold_count[(nrow(leopar)-1-length(test_dates)):(nrow(leopar)-2)],(leopar$Model_nolag[(nrow(leopar)-1-length(test_dates)):(nrow(leopar)-2)]))
rbind(m_with_reg,m_only_arima)
## n mean sd CV FBias MAPE RMSE MAD MADP
## 1 7 27 8.485281 0.3142697 -0.2953161 0.4192115 11.12198 9.559140 0.3540422
## 2 7 27 8.485281 0.3142697 -0.2910258 0.4130253 11.02719 9.380608 0.3474299
## WMAPE
## 1 0.3540422
## 2 0.3474299
When we compare the results, ARIMA without regressors seem to be performed slighly better. However, this is not a significant difference.
The black bikini is the next product. To analyze the data we read it first.
data_siyah <- raw_data[product_content_id == 32737302]
To understand the trend, the time series graph of the number of sold items should be examined.
ggplot(data_siyah, aes(x=event_date)) +
geom_line(aes(y = sold_count), color = "red") + ggtitle("Trend of Black Bikini Sales") + xlab("Date") + ylab("Sales")
There are lots of missing points in the data. As we don’t have a full year data, we cannot check for a monthly trend. We will check for a daily trend and examine the autocorrelation plot.
acf(data_siyah$sold_count, main= "Daily Autocorrelation")
siyah <- data_siyah
The data told us there is a trend, however any specific lag does not stand out. We do not omit N/A’s in order not to loose any information.
Although we did not see any peak point at day seven, it is logical to have a daily trend. That’s why we will make the decomposition daily. As the variance change over the time, we will first make a multiplicative decomposition.
siyahts <- ts(siyah$sold_count,freq=7)
data_mult<-decompose(siyahts,type="multiplicative")
random=data_mult$random
plot(data_mult)
The trend is increasing as the summer time arrives, however we can see a sharp decrease in June due to the stock out of smaller sizes. The detrend part of the data have many peaks which may cause to unstationarity. However, except the peaks mean and variance seem stable, so it look stationary.
In this graph it is hard to identify seasonal trend, daily trend, so we need to have a closer look.
plot(data_mult$seasonal[1:7], xlab = "Hour", ylab = "Multiplicative of Day", main = "Seasonal Component of Trend for Multiplicative")
To understand the multiplicative decomposition, we examine the multipliers of the days. To be able to compare the multiplicative behaviours and data, we also examine the mean sales of each day. Tuesdays are the maximum points and the trend also reflect it. However, although the mean very high on Wednesdays, the multiplier of Wednesdays are low.
Having higher sales on the early weekdays are normal, as people want the delivery before the end of the week.
mean_sold=siyah[,list(mean_sold = mean(sold_count,na.rm=T)), by="wday"]
mean_sold
## wday mean_sold
## 1: Mon 14.01724
## 2: Tue 16.55172
## 3: Wed 16.29310
## 4: Thu 15.25862
## 5: Fri 13.64912
## 6: Sat 12.43860
## 7: Sun 15.05263
To check the stationarity of the detrend part we use KPSS test
unt_test=ur.kpss(data_mult$random)
summary(unt_test)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 4 lags.
##
## Value of test-statistic is: 0.0722
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The data is stationary, because 0.07 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary.
The multipliers of the multiplicative model were not satisfying, we will also check the additive decomposition.
siyahts <- ts(siyah$sold_count,freq=7)
data_add<-decompose(siyahts,type="additive")
random=data_add$random
plot(data_add)
The trend is very similar to the additive version. For the detrend part, the outliers seems to be dissapered here. In the additive model, remaining portion seems more stationary.
In this graph it is hard to identify seasonal daily trend, so we need to have a closer look.
plot(data_add$seasonal[1:7], xlab = "Hour", ylab = "Additive of Day", main = "Seasonal Component of Trend for Additive")
mean_sold=siyah[,list(m_sold = mean(sold_count,na.rm=T)), by="wday"]
mean_sold
## wday m_sold
## 1: Mon 14.01724
## 2: Tue 16.55172
## 3: Wed 16.29310
## 4: Thu 15.25862
## 5: Fri 13.64912
## 6: Sat 12.43860
## 7: Sun 15.05263
In the additive model, the multipliers of days are pretty similar to the mean sales of each days. The high trend on the Tuesdays and Wednesdays have a corresponding high multipliers.
To check the stationarity of the detrend part we use KPSS test again
unt_test=ur.kpss(data_add$random)
summary(unt_test)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0063
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The data is stationary, because 0.0066 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary. Also this is smaller than the multiplicative decomposition so I will continue my analysis with the additive model.
acf(random, na.action = na.pass, main= "Detrend's Autocorrelation")
pacf(random, na.action = na.pass, main= "Detrend's Autocorrelation")
We can see that the autocorrelation plot is decreasing. We can see the significant partial autocorrelation in lag 6. That’s why AR(6) will be tried first.
model <- arima(random, order=c(6,0,0))
print(model)
##
## Call:
## arima(x = random, order = c(6, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 intercept
## 0.0369 -0.3348 -0.4273 -0.1462 -0.208 -0.3156 0.0034
## s.e. 0.0477 0.0465 0.0490 0.0492 0.047 0.0487 0.1135
##
## sigma^2 estimated as 29: log likelihood = -1232.44, aic = 2480.87
Also, the partial autocorrelaton plot is decreasing and max autocorrelation is at lag 3. Let’s also try MA(3).
modelf <- arima(random, order=c(0,0,3))
print(modelf)
##
## Call:
## arima(x = random, order = c(0, 0, 3))
##
## Coefficients:
## ma1 ma2 ma3 intercept
## -0.0715 -0.4262 -0.5023 -0.0043
## s.e. 0.0452 0.0376 0.0484 0.0054
##
## sigma^2 estimated as 26.69: log likelihood = -1217.92, aic = 2445.84
As the AIC is lower than the first one this model is better.
We need to decide their combination and try the neighbours to decide which model to use.
model <- arima(random, order=c(6,0,3))
print(model)
##
## Call:
## arima(x = random, order = c(6, 0, 3))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ma1 ma2
## -0.5865 -0.0903 0.0261 -0.4205 -0.1071 -0.1649 0.4962 -0.5314
## s.e. 0.0533 0.0582 0.0576 0.0570 0.0580 0.0536 0.0259 0.0147
## ma3 intercept
## -0.9649 -0.0044
## s.e. 0.0298 0.0031
##
## sigma^2 estimated as 23.33: log likelihood = -1192.87, aic = 2407.74
The model did not improved, we will continue with MA and try its neighbours:
model <- arima(random, order=c(0,0,4))
print(model)
##
## Call:
## arima(x = random, order = c(0, 0, 4))
##
## Coefficients:
## ma1 ma2 ma3 ma4 intercept
## -0.1057 -0.4504 -0.5138 0.0699 -0.0042
## s.e. 0.0558 0.0459 0.0475 0.0699 0.0051
##
## sigma^2 estimated as 26.61: log likelihood = -1217.41, aic = 2446.81
model <- arima(random, order=c(0,0,2))
print(model)
##
## Call:
## arima(x = random, order = c(0, 0, 2))
##
## Coefficients:
## ma1 ma2 intercept
## -0.3481 -0.6519 -0.0044
## s.e. 0.0323 0.0319 0.0040
##
## sigma^2 estimated as 31.67: log likelihood = -1251.98, aic = 2511.96
Both of them have lowered the performance so we will choose (0,0,3) for the arima models.
To understand the the the possible regressors that can be used in the ARIMA model, the correlations with both detrend data and sold_count is important.
Than, the lagged variable will be adwded to the model as for the predictions we need to have the data.
siyah <- siyah[, random := data_add$random]
numeric_data <- siyah
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame': 175 obs. of 12 variables:
## $ price : num 60 60 60 60 60.5 ...
## $ sold_count : num 33 33 39 44 32 32 21 28 25 27 ...
## $ visit_count : num 0 0 0 0 0 0 0 0 0 0 ...
## $ basket_count : num 150 161 169 186 113 117 133 132 106 119 ...
## $ favored_count : num 0 0 0 0 0 0 0 0 0 0 ...
## $ category_sold : num 722 968 1279 1320 992 ...
## $ category_visits : num 1052 1374 1758 1853 1429 ...
## $ category_basket : num 0 0 0 0 0 0 0 0 0 0 ...
## $ category_favored : num 5566 7406 9076 9482 6511 ...
## $ category_brand_sold: num 0 0 0 0 0 0 0 0 0 0 ...
## $ ty_visits : num 1 1 1 1 1 1 1 1 1 1 ...
## $ random : num -5.501 -3.07 5.878 10.035 0.243 ...
## - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)
The biggest correlation with the data is basket_count, visit_count and the category_sold. To be able to use them in our prediction, we are adding the lagged versions.
For the price it does not have true correlation, as the data does not contain all the discount types.
siyah <- siyah[,category_sold_lag := shift(category_sold, 2)]
siyah <- siyah[,basket_count_lag := shift(basket_count, 2)]
siyah <- siyah[,visit_count_lag := shift(visit_count, 2)]
numeric_data <- siyah
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame': 175 obs. of 15 variables:
## $ price : num 60 60 60 60 60.5 ...
## $ sold_count : num 33 33 39 44 32 32 21 28 25 27 ...
## $ visit_count : num 0 0 0 0 0 0 0 0 0 0 ...
## $ basket_count : num 150 161 169 186 113 117 133 132 106 119 ...
## $ favored_count : num 0 0 0 0 0 0 0 0 0 0 ...
## $ category_sold : num 722 968 1279 1320 992 ...
## $ category_visits : num 1052 1374 1758 1853 1429 ...
## $ category_basket : num 0 0 0 0 0 0 0 0 0 0 ...
## $ category_favored : num 5566 7406 9076 9482 6511 ...
## $ category_brand_sold: num 0 0 0 0 0 0 0 0 0 0 ...
## $ ty_visits : num 1 1 1 1 1 1 1 1 1 1 ...
## $ random : num -5.501 -3.07 5.878 10.035 0.243 ...
## $ category_sold_lag : num 1223 848 722 968 1279 ...
## $ basket_count_lag : num 211 141 150 161 169 186 113 117 133 132 ...
## $ visit_count_lag : num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)
The correlations with lagged variables are all very closed to each other. Let’s add them one by one to see the improvement.
reg_matrix=cbind( siyah$basket_count_lag) # can add more if any other regressors exist
model1 <- arima(siyah$random,order = c(0,0,3), xreg = reg_matrix)
summary(model1)
##
## Call:
## arima(x = siyah$random, order = c(0, 0, 3), xreg = reg_matrix)
##
## Coefficients:
## ma1 ma2 ma3 intercept reg_matrix
## -0.0719 -0.4263 -0.5018 -0.0091 1e-04
## s.e. 0.0453 0.0376 0.0484 0.0088 2e-04
##
## sigma^2 estimated as 26.66: log likelihood = -1217.67, aic = 2447.35
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.03134201 5.163451 2.973833 63.64687 99.8179 0.616558
## ACF1
## Training set -0.02027125
The AIC does not improve when we add the lagged variable.
model_fitted <- siyah$random - residuals(model1)
siyah <- cbind(siyah, data_add$seasonal, data_add$trend, model_fitted)
siyah <-siyah[,predictarima := data_add$seasonal + data_add$trend + model_fitted ]
model_fitted_2 <- siyah$random - residuals(modelf)
siyah <- cbind(siyah, model_fitted_2)
siyah <-siyah[,predictonlyar := data_add$seasonal + data_add$trend + model_fitted_2]
ggplot(siyah, aes(x=event_date)) +
geom_line(aes(y = sold_count), color = "red") +
geom_line(aes(y = predictarima), color="blue") + ggtitle("Trend of Black Bikini Sales") + xlab("Date") + ylab("Sales")
## Warning: Removed 6 row(s) containing missing values (geom_path).
Adding the second:
reg_matrix=cbind( siyah$visit_count_lag)
model2 <- arima(siyah$random,order = c(0,0,3), xreg = reg_matrix)
summary(model2)
##
## Call:
## arima(x = siyah$random, order = c(0, 0, 3), xreg = reg_matrix)
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
## ma1 ma2 ma3 intercept reg_matrix
## -0.0719 -0.4263 -0.5018 -0.0092 0
## s.e. 0.0453 0.0376 0.0484 NaN NaN
##
## sigma^2 estimated as 26.66: log likelihood = -1217.63, aic = 2447.26
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.02003195 5.162873 2.971803 63.58047 99.35088 0.6161371
## ACF1
## Training set -0.02029881
The AIC has not lowered, so the model did not improve, let’s try last lagged variable to see any improvements.
reg_matrix=cbind( siyah$category_sold_lag)
model3 <- arima(siyah$random,order = c(0,0,3), xreg = reg_matrix)
summary(model3)
##
## Call:
## arima(x = siyah$random, order = c(0, 0, 3), xreg = reg_matrix)
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
## ma1 ma2 ma3 intercept reg_matrix
## -0.0715 -0.4263 -0.5022 -0.0094 0
## s.e. 0.0453 0.0375 0.0484 NaN NaN
##
## sigma^2 estimated as 26.67: log likelihood = -1217.75, aic = 2447.5
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.06101857 5.164487 2.985519 63.99305 101.3311 0.6189809
## ACF1
## Training set -0.02013689
The AIC is higher, thus the performance is lowered with the addition of every regressors. Visit count lag has the lowest AIC values, we will make the prediction with it and only arima ad compare the results.
Testing with regressor:
train_start=as.Date('2021-01-23')
test_start=as.Date('2021-06-24')
test_end=as.Date('2021-06-30')
test_dates=seq(test_start,test_end,by='day')
for(i in 1:length(test_dates)){
current_date=test_dates[i]-2
past_data <- siyah[event_date<=current_date,]
siyah_ts <- ts(past_data$sold_count, frequency = 7)
siyah_decomposed <- decompose(siyah_ts, type="additive")
model <- arima(siyah_decomposed$random,order = c(0,0,3),xreg = past_data$visit_count_lag)
forecasted=predict(model,n.ahead = 2,newxreg = siyah[event_date==test_dates[i],visit_count_lag])
siyah[nrow(siyah)-length(test_dates)+i-2, Model_reg := forecasted$pred[2]+siyah_decomposed$seasonal[(nrow(siyah)-length(test_dates)+i-2)%%7+7]+siyah_decomposed$trend[max(which(!is.na(siyah_decomposed$trend)))]]
}
m_with_lag<-accu(siyah$sold_count[(nrow(siyah)-1-length(test_dates)):(nrow(siyah)-2)],(siyah$Model_reg[(nrow(siyah)-1-length(test_dates)):(nrow(siyah)-2)]))
Testing with only arima:
for(i in 1:length(test_dates)){
current_date=test_dates[i]-2
past_data <- siyah[event_date<=current_date,]
siyah_ts <- ts(past_data$sold_count, frequency = 7)
siyah_decomposed <- decompose(siyah_ts, type="additive")
model <- arima(siyah_decomposed$random,order = c(0,0,3))
forecasted=predict(model,n.ahead = 2)
siyah[nrow(siyah)-length(test_dates)+i-2, Model_no_reg := forecasted$pred[2]+siyah_decomposed$seasonal[(nrow(siyah)-length(test_dates)+i-2)%%7+7]+siyah_decomposed$trend[max(which(!is.na(siyah_decomposed$trend)))]]
}
m_without_lag<-accu(siyah$sold_count[(nrow(siyah)-1-length(test_dates)):(nrow(siyah)-2)],(siyah$Model_no_reg[(nrow(siyah)-1-length(test_dates)):(nrow(siyah)-2)]))
Without lag version performed slightly better, it is accepted as the AIC value were also lower for this.
The next product is the leggings:
data_tayt <- raw_data[product_content_id == 31515569]
To have a better understanding, we plot the time series of the legging sales
ggplot(data_tayt, aes(x=event_date)) +
geom_line(aes(y = sold_count), color = "red") + ggtitle("Trend of Leggings Sales") + xlab("Date") + ylab("Sales")
We will check for a daily trend, we do not have enough data to check monthly or weekly trend. We expect a 7 days frequency, to be able to see the seasonlity we will control autocorrelation.
acf(data_tayt$sold_count, main= "Daily Autocorrelation")
pacf(data_tayt$sold_count, main= "Daily Autocorrelation")
We have a peak at lag 16. We will use it in the decomposition. We will also check 7 days decomposition and compare the two according to their
In order not to lose information, we do not delete N/A’s, the days with 0 sales
Although we did not see any peak point at day seven, it is logical to have a daily trend. That’s why we will make the decomposition daily. As the variance change over the time, we will make a multiplicative decomposition. As there is a peak at lag 16, we will also try decomposition at lag 16.
tayt <- data_tayt
taytts <- ts(tayt$sold_count,freq=7)
data_mult<-decompose(taytts,type="multiplicative")
random=data_mult$random
plot(data_mult)
We have peaks in the, there are peak sales in the November. The leggings can be preferred to be worn in the autumun. This can be also the low prices in that time. The trend part of the decomposition seems to catch the overall trend.
When we examine the detrend part, we see a stationary data with a constant mean and variance.
In this graph it is hard to identify seasonal trend, daily trend, so we need to have a closer look.
plot(data_mult$seasonal[1:7], xlab = "Hour", ylab = "Multiplicative of Day", main = "Seasonal Component of Trend for Multiplicative")
To compare the seasonality multipliers with sales mean of that days, we examine the mean sales:
mean_sold=tayt[,list(mean_sold = mean(sold_count,na.rm=T)), by="wday"]
mean_sold
## wday mean_sold
## 1: Mon 811.1379
## 2: Tue 940.0000
## 3: Wed 1034.5345
## 4: Thu 1039.1897
## 5: Fri 696.5789
## 6: Sat 580.2807
## 7: Sun 608.4035
The seasonality seems to reflect the actual mean values
To check the stationarity of the detrend part
unt_test=ur.kpss(data_mult$random)
summary(unt_test)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.1184
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The data is stationary, because 0.1325 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary.
We will also control the additive decomposition to see whether it performs better.
taytts <- ts(tayt$sold_count,freq=7)
data_add<-decompose(taytts,type="additive")
random=data_add$random
plot(data_add)
The trend part of the decomposition seems to catch the overall trend weel.
When we examine the detrend part, we see a stationary data with a constant mean and variance. When compared to multiplicative the mean is smaller and the outlier peak can be identified in the detrend part.
In this graph it is hard to identify seasonal trend, daily trend, so we need to have a closer look.
plot(data_add$seasonal[1:7], xlab = "Hour", ylab = "Additive of Day", main = "Seasonal Component of Trend for Additive")
The seasonality seems to reflect the mean valeus of the days, the seasonality is pretty similar to the multiplicative decomposition results.
To check the stationarity of the detrend part
unt_test=ur.kpss(data_add$random)
summary(unt_test)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0063
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The data is stationary, because 0.0063 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary. This is also smaller than the multiplicative decomposition, that’s why we will continue with the additive decomposition.
As the autocorrelation plot gives higher values at lag 16, we will examine the data. We will use addiitive decomposition as it performmed better for the decomposion with 7 days frequency.
taytts <- ts(tayt$sold_count,freq=16)
data_add_2<-decompose(taytts,type="additive")
plot(data_add_2)
During the peak time, it seems like the data catches the decrease later. Random part seems similar to other model
In this graph it is hard to identify seasonal trend, daily trend, so we need to have a closer look.
plot(data_add_2$seasonal[1:16], xlab = "Hour", ylab = "Additive of Day", main = "Seasonal Component of Trend for Additive")
Seasonality seems to have the peaks at the second weeks Tuesdays and Wednesdays, however there is 16 days, we are tracking of the days of week in this decomposition.
To check the stationarity of the detrend part
unt_test=ur.kpss(data_add_2$random)
summary(unt_test)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0056
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The random part is also stationary, however we will continue with our analysis with 7 days decomposition.
acf(random, na.action = na.pass, main= "Detrend's Autocorrelation")
pacf(random, na.action = na.pass, main= "Detrend's Autocorrelation")
We can see that the autocorrelation plot is decreasing. We can see the significant partial autocorrelation at lag 4. That’s why AR(4) will be tried first.
model <- arima(random, order=c(4,0,0))
print(model)
##
## Call:
## arima(x = random, order = c(4, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 ar4 intercept
## 0.2668 -0.1144 -0.1979 -0.2597 -0.0114
## s.e. 0.0485 0.0494 0.0493 0.0484 18.6062
##
## sigma^2 estimated as 232696: log likelihood = -3016.65, aic = 6045.3
Also, the partial autocorrelaton plot is decreasing and significant autocorrelation is at lag 4. Let’s also try MA(4).
model <- arima(random, order=c(0,0,4))
print(model)
##
## Call:
## arima(x = random, order = c(0, 0, 4))
##
## Coefficients:
## ma1 ma2 ma3 ma4 intercept
## 0.0451 -0.3005 -0.4005 -0.3441 0.0780
## s.e. 0.0461 0.0457 0.0431 0.0512 0.6099
##
## sigma^2 estimated as 202322: log likelihood = -2991.02, aic = 5994.03
As the AIC is lower than the first one this model is better.
We need to decide their combination and try the neighbours to decide which model to use.
model <- arima(random, order=c(4,0,4))
print(model)
##
## Call:
## arima(x = random, order = c(4, 0, 4))
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
## ar1 ar2 ar3 ar4 ma1 ma2 ma3 ma4
## 0.2376 0.1942 -0.0391 -0.401 -0.244 -0.5604 -0.3554 0.1598
## s.e. NaN 0.5659 NaN NaN NaN 0.5899 NaN NaN
## intercept
## 0.0598
## s.e. 0.3341
##
## sigma^2 estimated as 183334: log likelihood = -2972.15, aic = 5964.3
This is the lowest AIC, thus the best model. Trying the neighbors:
model <- arima(random, order=c(3,0,4))
print(model)
##
## Call:
## arima(x = random, order = c(3, 0, 4))
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
## ar1 ar2 ar3 ma1 ma2 ma3 ma4 intercept
## 0.6368 0.6023 -0.6433 -0.7536 -1.0818 0.5753 0.2601 0.0704
## s.e. NaN NaN NaN NaN NaN NaN 0.0237 0.0767
##
## sigma^2 estimated as 171534: log likelihood = -2960.91, aic = 5939.82
Using AR(3) increased the performance. Checking for MA(3):
modelf <- arima(random, order=c(4,0,3))
print(modelf)
##
## Call:
## arima(x = random, order = c(4, 0, 3))
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
## ar1 ar2 ar3 ar4 ma1 ma2 ma3 intercept
## 0.9752 0.3597 -0.8759 0.2881 -1.0612 -0.7895 0.8508 0.074
## s.e. NaN NaN NaN 0.0329 NaN NaN NaN 0.073
##
## sigma^2 estimated as 168325: log likelihood = -2957.1, aic = 5932.2
This is the lowest AIC. We will continue our model with (4,0,3)
To understand the the the possible regressors that can be used in the ARIMA model, the correlations with both detrend data and sold_count is important.
Than, the lagged variable will be addded to the model as for the predictions we need to have the data.
tayt <- tayt[, random := data_add$random]
numeric_data <- tayt
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame': 397 obs. of 12 variables:
## $ price : num 45 45 45 45 45 ...
## $ sold_count : num 366 1188 1162 895 649 ...
## $ visit_count : num 0 0 0 0 0 0 0 0 0 0 ...
## $ basket_count : num 2293 6379 5640 4200 3084 ...
## $ favored_count : num 0 0 0 0 0 0 0 0 0 0 ...
## $ category_sold : num 836 1581 1530 1353 868 ...
## $ category_visits : num 7436 7736 9176 8896 7154 ...
## $ category_basket : num 0 0 0 0 0 0 0 0 0 0 ...
## $ category_favored : num 35389 36277 42223 38858 30893 ...
## $ category_brand_sold: num 0 0 0 0 0 0 0 0 0 0 ...
## $ ty_visits : num 1 1 1 1 1 1 1 1 1 1 ...
## $ random : num -563 604 665 316 -162 ...
## - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)
The biggest correlation with the data is basket_count and category_sold. To use them in the regression, we are adding to the model with the lagged versions.
tayt <- tayt[,category_sold_lag := shift(category_sold, 2)]
tayt <- tayt[,basket_count_lag := shift(basket_count, 2)]
numeric_data <- tayt
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame': 397 obs. of 14 variables:
## $ price : num 45 45 45 45 45 ...
## $ sold_count : num 366 1188 1162 895 649 ...
## $ visit_count : num 0 0 0 0 0 0 0 0 0 0 ...
## $ basket_count : num 2293 6379 5640 4200 3084 ...
## $ favored_count : num 0 0 0 0 0 0 0 0 0 0 ...
## $ category_sold : num 836 1581 1530 1353 868 ...
## $ category_visits : num 7436 7736 9176 8896 7154 ...
## $ category_basket : num 0 0 0 0 0 0 0 0 0 0 ...
## $ category_favored : num 35389 36277 42223 38858 30893 ...
## $ category_brand_sold: num 0 0 0 0 0 0 0 0 0 0 ...
## $ ty_visits : num 1 1 1 1 1 1 1 1 1 1 ...
## $ random : num -563 604 665 316 -162 ...
## $ category_sold_lag : num 953 777 836 1581 1530 ...
## $ basket_count_lag : num 2226 1754 2293 6379 5640 ...
## - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)
The basket_count with lag have the highest autocorrelation, that’s why start with it.
reg_matrix=cbind( tayt$basket_count_lag) # can add more if any other regressors exist
model1 <- arima(tayt$random,order = c(4,0,3), xreg = reg_matrix)
summary(model)
##
## Call:
## arima(x = random, order = c(3, 0, 4))
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
## ar1 ar2 ar3 ma1 ma2 ma3 ma4 intercept
## 0.6368 0.6023 -0.6433 -0.7536 -1.0818 0.5753 0.2601 0.0704
## s.e. NaN NaN NaN NaN NaN NaN 0.0237 0.0767
##
## sigma^2 estimated as 171534: log likelihood = -2960.91, aic = 5939.82
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -4.901652 414.1668 227.7179 97.47823 154.436 0.7217732 0.02486137
The AIC remain the same, we can say that the model neither improved nor worsened.
We add the other potential regressor to the model.
reg_matrix=cbind(tayt$category_sold_lag)
model1 <- arima(tayt$random,order = c(4,0,3), xreg = reg_matrix)
summary(model1)
##
## Call:
## arima(x = tayt$random, order = c(4, 0, 3), xreg = reg_matrix)
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
## ar1 ar2 ar3 ar4 ma1 ma2 ma3 intercept
## 0.9173 0.4814 -0.9728 0.3166 -1.0111 -0.9191 0.9305 -0.844
## s.e. NaN NaN NaN NaN NaN NaN NaN NaN
## reg_matrix
## 4e-04
## s.e. NaN
##
## sigma^2 estimated as 165884: log likelihood = -2954.36, aic = 5928.72
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 8.982585 407.2884 226.2653 103.9454 162.7711 0.7171692 -0.01527941
The AIC is lowered, we can use the category_sold variable as lag
To see ARIMA without lags and with lag in the same graph:
model_fitted <- tayt$random - residuals(model1)
tayt <- cbind(tayt, data_add$seasonal, data_add$trend, model_fitted)
tayt <-tayt[,predictarima := data_add$seasonal + data_add$trend + model_fitted ]
model_fitted_2 <- tayt$random - residuals(modelf)
tayt <- cbind(tayt,model_fitted_2)
tayt <-tayt[,predictarima_no_reg := data_add$seasonal + data_add$trend + model_fitted_2 ]
ggplot(tayt, aes(x=event_date)) +
geom_line(aes(y = sold_count, color = "sold_count")) +
geom_line(aes(y = predictarima, color = "only_arima_prediction")) +
geom_line(aes(y = predictarima_no_reg, color = "arima_with_lag")) + ggtitle("Trend of Leggings Sales") + xlab("Date") + ylab("Sales")
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 6 row(s) containing missing values (geom_path).
The two models yield pretty similar results and both seem to catch the trend except the peak points.
To predict the data:
train_start=as.Date('2021-01-23')
test_start=as.Date('2021-06-23')
test_end=as.Date('2021-06-30')
test_dates=seq(test_start,test_end,by='day')
for(i in 1:length(test_dates)){
current_date=test_dates[i]-2
past_data <- tayt[event_date<=current_date,]
tayt_ts <- ts(past_data$sold_count, frequency = 7)
tayt_decomposed <- decompose(tayt_ts, type="additive")
model <- arima(tayt_decomposed$random,order = c(4,0,3),xreg = past_data$basket_count_lag)
forecasted=predict(model,n.ahead = 2,newxreg = tayt[event_date==test_dates[i],basket_count_lag])
tayt[nrow(tayt)-length(test_dates)+i-2, Model_reg := forecasted$pred[2]+tayt_decomposed$seasonal[(nrow(tayt)-length(test_dates)+i-2)%%7+7]+tayt_decomposed$trend[max(which(!is.na(tayt_decomposed$trend)))]]
}
m_with_7<-accu(tayt$sold_count[(nrow(tayt)-1-length(test_dates)):(nrow(tayt)-2)],(tayt$Model_reg[(nrow(tayt)-1-length(test_dates)):(nrow(tayt)-2)]))
For the data with no regressors:
for(i in 1:length(test_dates)){
current_date=test_dates[i]-2
past_data <- tayt[event_date<=current_date,]
tayt_ts <- ts(past_data$sold_count, frequency =7)
tayt_decomposed <- decompose(tayt_ts, type="additive")
model <- arima(tayt_decomposed$random,order = c(4,0,3))
forecasted=predict(model,n.ahead = 2)
tayt[nrow(tayt)-length(test_dates)+i-2, Model_noreg := forecasted$pred[2]+tayt_decomposed$seasonal[(nrow(tayt)-length(test_dates)+i-2)%%7+7]+tayt_decomposed$trend[max(which(!is.na(tayt_decomposed$trend)))]]
}
m_with_no_reg<-accu(tayt$sold_count[(nrow(tayt)-1-length(test_dates)):(nrow(tayt)-2)],(tayt$Model_noreg[(nrow(tayt)-1-length(test_dates)):(nrow(tayt)-2)]))
rbind(m_with_7,m_with_no_reg)
## n mean sd CV FBias MAPE RMSE MAD MADP
## 1 8 257 42.77182 0.1664273 -0.19691640 0.6108043 157.2306 151.4369 0.5892486
## 2 8 257 42.77182 0.1664273 -0.09805438 0.5866824 153.3859 146.6330 0.5705563
## WMAPE
## 1 0.5892486
## 2 0.5705563
Arima with no external regressors performed better in terms of the performance measures WMAPE, MAPE, MAD is all lower with ARIMA with no regressors.
In this homework, first of all we have examined for the possible seasonalities that each prodcut may have. As there are few data points, we mostly examine the daily trends. We have tried to explain the possible reasons for this kind of trends. Such as weekends and people tendecies to shop. After the decomposition of the data set, we create AR, MA and ARIMA models. We added the regressors to these ARIMA Models and finally we have tested for a week period by using two days difference.